Error reducing sampling in reinforcement learning

نویسنده

  • Bruno Scherrer
چکیده

In reinforcement learning, an agent collects information interacting with an environment and uses it to derive a behavior. This paper focuses on efficient sampling; that is, the problem of choosing the interaction samples so that the corresponding behavior tends quickly to the optimal behavior. Our main result is a sensitivity analysis relating the choice of sampling any stateaction pair to the decrease of an error bound on the optimal solution. We derive two new modelbased algorithms. Simulations demonstrate a quicker convergence (in the sense of the number of samples) of the value function to the real optimal value function. Introduction In reinforcement learning, an agent collects information interacting with an environment and uses it to derive a behavior. This paper focuses on efficient sampling; that is, the problem of choosing the interaction samples so that the corresponding behavior tends quickly to the optimal behavior. The problem we consider here is different from the well-known exploration-exploitation dilemma (Kumar, 1985), in which an agent wants to collect information while optimizing its interaction. In this paper we consider the case where the agent wants to find the samples that will allow it to tend to the optimal behavior with fewer samples, while not caring about its exploration performance. A typical setting where the present work might be useful is when the agent has a practice epoch at its disposal when its performance does not matter. For instance, it might be a computer game player which is practicing before a competition like the famous Back-Gammon TD-player (Tesauro, 1995), or a robot which learns in a non-harmful environment (e.g. on Earth) before actually going to a similar risky environment (e.g. on Mars) (Bernstein et al., 2001). Another case where performance during training is irrelevant is neurodynamic programming (Bertsekas and Tsitsiklis, 1996), where reinforcement learning methods are used to solve very large MDPs in simulation. Tackling this sampling issue is all the more relevant when sampling has a high cost (in the robot example, interacting with the world costs a lot of time and energy). In all these problems, we want the computed behavior to tend to the optimal behavior quickly with the number of samples. Our approach is the following: we first derive a confidence bound on the optimal value function, then we make a sensitivity analysis relating the choice of sampling any state-action pair to the tightening of this confidence bound. Our main result is Theorem 3, where we actually predict how sampling in a given state-action pair will tighten the confidence bound. With such an analysis, an agent can, step after step, choose to sample the state-action pair that will tighten its confidence on its behavior quality the most. Going even further, section 5 introduces an Error MDP, whose optimal policy corresponds to the best sampling strategy for tightening the confidence bound in the long-term. Most work in reinforcement learning sampling analysis (Bertsekas and Tsitsiklis, 1996; Even-Dar et al., 2003; Kearns and Koller, 1999; Kearns and Singh, 1998) rely on the maximum L∞ norm. Though sufficient for many convergence results, L∞ bounds are often disappointing as they don’t give a precise picture of where and why the approximation is bad. In this paper, we provide a confidence bound with respect to the L1 norm. Such a bound allows us to have a more precise insight of where and how much in the state-action space, sampling error on the parameters R and T incur a global cost on the value function. The paper is organized as follows. Section 1 presents the core of reinforcement learning: we briefly present the theory of optimal control with Markov decision processes (MDPs) and the certainty equivalence method for reinforcement learning. Section 2 reviews recent results for analyzing approximations in the MDP framework. In section 3, we apply this analysis to the reinforcement learning problem and prove the key theorem of this paper: Theorem 3 shows how to estimate the effect of sampling a particular state-action pair on the approximation error. Section 4 then describes two new algorithms that are based on this key theorem. Section 5 illustrates experimentally and discusses the results of these algorithms. Finally, section 6 provides a discussion of the related literature. 1 The model Markov Decision processes (MDP) (Puterman, 1994) provide the theoretical foundations of challenging problems to researchers in artificial intelligence and operation research. These problems include optimal control and reinforcement learning (Sutton and Barto, 1998). A Markov Decision Process is a controlled stochastic process satisfying the Markov property with rewards (numerical values) assigned to state-action pairs. Formally, an MDP M is a four-tuple 〈S,A, T,R〉 where S is the state space, A is the action space, T is the transition function and R is the reward function. T is the state-transition probability distribution conditioned by the action; for all state-action pairs (s, a) and possible subsequent states s′: T (s, a, s′) def = IP(st+1 = s′|st = s, at = a). R(s, a) ∈ IR is the random variable which corresponds to the instantaneous reward for taking action a ∈ A in state S. We assume throughout this paper that R is bounded. Then, without loss of generality, we also assume that it is contained in the interval (0, Rmax). Given an MDP 〈S,A, T,R〉, the optimal control problem consists in finding a sequence of actions (a0, a1, a2, ...) that maximises the expected long-term discounted sum of rewards: IE [ ∑∞ t=0 γ R(st, at)| s0 = s, at] where the expectation is over the runs of the Markov chain induced by (a0, a1, a2, ...), and γ ∈ (0, 1) is a discount factor. A well-known fundamental result is that an optimal sequence of actions can be derived from a deterministic function π : S → A, called policy, which prescribes which action to take in every state. The value function of a policy π at state s is the expected long-term discounted amount of rewards if one follows policy π from state s: V (s) def = IE [ ∑∞ t=0 γ R(st, at)| s0 = s, at = π(st)] where the expectation is over the runs of the Markov chain induced by π, and satisfies for all states s: V (s) def = IE[R(s, π(s))]+γ ∑ s T (s, π(s), s ′)V π(s′). The Q-function of a policy π for state-action pair (s, a) is the expected long-term discounted amount of rewards if one does action a from state s and then follows the policy π: Q(s, a) def = IE [ ∑∞ t=0 γ R(st, at)| s0 = s, at = { a if t = 0 π(st) else ] and satisfies: Q(s, a) def = IE[R(s, a)] + γ ∑ s T (s, a, s ′)V π(s′). Given these notations, the optimal control problem amounts to finding an optimal policy π∗ whose value V ∗, called the optimal value function, is the greatest for all states: ∀s ∈ S, V ∗(s) = maxπ V (s). Such an optimal policy exists and its value function V ∗, is the unique fixed point a contraction mapping, so that for all states s: V ∗(s) = maxa (IE[R(s, a)] + γ ∑ s T (s, a, s ′)V ∗(s′)) . The corresponding optimal Q-function Q∗(s, a) def = IE[R(s, a)] + γ ∑ s T (s, a, s ′)V ∗(s′) is particularly interesting as it enables us to derive a deterministic optimal policy π∗(s) as follows: π∗(s) = argmaxa Q∗(s, a). A standard algorithm for solving optimal control is Policy Iteration (Puterman, 1994) which converges to the optimal solution in a finite number of iterations. The reinforcement learning problem is a variant of optimal control where the MDP parameters (R and T ) are initially unknown, and therefore must be estimated through sample experiments (Sutton and Barto, 1998). While optimal control only involves planning, reinforcement learning involves both learning (estimation of the parameters) and planning and is therefore a slightly more difficult problem. A standard and natural solution to this problem, known as the certainty equivalence method (Kumar and Varaiya, 1986), consists in estimating the unknown parameters R and T , and then deriving a policy from these estimates. Let #(s, a) be the number of times one has taken action a in state s. Let #(s, a, s′) be the number of times one arrived in state s′ after having taken action a in state s. Let ΣR(s, a) be the cumulative amount of rewards received when taking action a in state s. The idea of the certainty equivalence method is to solve the MDP M̂ = 〈S,A, T̂ , R̂〉 where R̂(s, a) def = ΣR(s, a) #(s, a) and T̂ (s, a, s′) def = #(s, a, s′) #(s, a) . (1) are the maximum-likelihood estimates of R and T . After a finite number of samples, choosing the optimal policy given this empirical model is clearly an approximation. The next sections provide an explicit analysis of this approximation. 2 The approximation error In this section, we review some recent general results about approximation in MDPs. We will apply them to the reinforcement learning case in the next section. Recall that, in the discounted optimal control problem, we want to find the optimal value function V ∗ which satisfies for all states s: V ∗(s) = maxa [BaV ∗] (s) where Ba, often referred to as the Bellman operator, returns for any real-valued function W on S, and any action a, a new real-valued function of s: [BaW ](s) def = IE[R(s, a)] + γ ∑ s T (s, a, s ′)W (s′). Consider that, instead of using this Bellman operator Ba, we use a slightly different Bellman operator B̂a, which for any real-valued function W on S, and any action a, returns the following function of s : [B̂aW ](s) def = R̂(s, a)+ γ ∑ s T̂ (s, a, s ′)W (s′). We shall call B̂a the approximate Bellman operator as it is based on some approximate parameters R̂ and T̂ . For any policy π, let V̂ π be the value of the policy based on these approximate parameters. Similarly, let V̂ ∗ be the corresponding optimal value function and π̂∗ the corresponding optimal policy. In the remaining of this section, we show how to analyze the error due to using B̂a instead of Ba. Suppose e(s, a) is an upper bound of the error if using the approximate parameters to operate on the real optimal value function V ∗: ∣

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...

متن کامل

Construction of approximation spaces for reinforcement learning

Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such s...

متن کامل

A Reinforcement-Learning Approach to Color Quantization

Color quantization is a process of sampling three-dimensional color space (e.g. RGB) to reduce the number of colors in a color image. By reducing to a discrete subset of colors known as a color codebook or palette, each pixel in the original image is mapped to an entry according to these palette colors. In this paper, a reinforcement-learning approach to color image quantization is proposed. Fu...

متن کامل

Modeling the Adaptation of Search Termination in Human Decision Making

We study how people terminate their search for information when making decisions in a changing environment. In 3 experiments, differing in the cost of search, participants made a sequence of 2-alternative decisions, based on the information provided by binary cues they could search. Whether limited or extensive search was required to maintain accurate decisions changed across the course of the ...

متن کامل

Reducing state space exploration in reinforcement learning problems by rapid identification of initial solutions and progressive improvement of them

Most existing reinforcement learning methods require exhaustive state space exploration before converging towards a problem solution. Various generalization techniques have been used to reduce the need for exhaustive exploration, but for problems like maze route finding these techniques are not easily applicable. This paper presents an approach that makes it possible to reduce the need for stat...

متن کامل

Feature Engineering for Predictive Modeling using Reinforcement Learning

Feature engineering is a crucial step in the process of predictive modeling. It involves the transformation of given feature space, typically using mathematical functions, with the objective of reducing the modeling error for a given target. However, there is no well-defined basis for performing effective feature engineering. It involves domain knowledge, intuition, and most of all, a lengthy p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006